Opened 15 years ago
Closed 9 years ago
#2864 closed enhancement (fixed)
ogr2ogr, shp > kml: broken XML, because of wrong encoding declaration
Reported by: | peifer | Owned by: | warmerdam |
---|---|---|---|
Priority: | normal | Milestone: | |
Component: | OGR_SF | Version: | 1.6.0 |
Severity: | normal | Keywords: | |
Cc: |
Description
I see from Ticket #1494 that the issue is known. I just raise it again:
I am trying to convert shp files to kml and end up with *a lot* of broken XML files, as ogr2ogr doesn't seem to make an effort to detect the dbf file's character encoding and just dumps the attribute values into the kml file and declares the result to be UTF-8 encoded, which is wrong in most cases. At least over here in Europe, where people like to stick a lot of accented characters into the attribute values.
I am trying to fix this issue by using a quick and dirty shell script, which is guessing the source encoding from the language identifier, i.e. byte 29 from the dbf file header and subsequently using iconv to convert characters into proper UTF-8 encoding.
I am just wondering: couldn't this be built into the shapefile driver, or am I missing something? (I would guess that most likely the latter is the case.)
Change History (10)
follow-up: 2 comment:1 by , 15 years ago
comment:2 by , 15 years ago
Replying to warmerdam:
...and there is no one with the skills and time prepared to work on this issue at this time.
OK. I see that it then would make sense to invest a bit more in my dirty encoding-guessing shell script. The basic approach is: either I can make some sense out of the dbf's byte 29, or I assume CP1252, which is a good guess, these days.
Just in case someone would be interested in the results from converting ~1000 shp files, collected from all over Europe:
ogr2ogr, out of the box: ~50% well-formed XML, 0% valid XML
ogr2ogr, plus sh script: ~95% well-formed and valid XML
By the way: does the dbf driver, or some other piece of code strip potential control chars, which would be illegal character data in the KML/GML files? The quite popular non-breaking space comes into my mind...
follow-up: 4 comment:3 by , 15 years ago
I am not aware of any logic in OGR that would alter string attribute characters in normal processing.
comment:4 by , 15 years ago
Replying to warmerdam:
I am not aware of any logic in OGR that would alter string attribute characters in normal processing.
It looks like <, > and & from my test.dbf file are properly escaped, when appearing as character data in the KML file. This is good.
<SimpleData name="N7"><</SimpleData> <SimpleData name="N8">></SimpleData> <SimpleData name="N9">&</SimpleData>
comment:5 by , 15 years ago
Here is what I am currently doing on top of ogr2ogr -f KML output in order to generate valid KML. This shell script is perhaps not too elegant, but it works in practice.
A hint from my testing: the encoding-guessing via the dbf file's header byte 29 seems to be unreliable, as not all shapefile-generating applications are setting this byte to an appropriate value. Nevertheless, the recoding through iconv -c works at least in so far, as iconv strips all invalid characters.
#!/bin/bash # # Generate a valid KML file based on ogr2ogr conversion # plus some dirty quick hacks for fixing typical errors # Hermann, March 2009 # Check if we have a big shape file. Experience shows # that file size kml file: ~3-4 * file size shape file # KML files > 100M might be too large for the user's PC # FILESIZE=$(stat -c%s "$1") if [[ $FILESIZE -gt 30000000 ]] then printf "%s\n" "KML file can not be generated: The shape file is too large ($FILESIZE bytes)" > /dev/stderr exit 1 fi # Check if we have a prj file, which would be nice # if [[ -f "${1%shp}prj" || -f "${1%SHP}PRJ" ]] then t_srs="-t_srs EPSG:4326" else t_srs= fi # See if we have a dbf file and make a guess on its encoding, based on # code pages listed in the ArcGIS v9, ArcPad Reference Guide # http://downloads.esri.com/support/documentation/pad_/ArcPad_RefGuide_1105.pdf # if [[ ! -f "${1%shp}dbf" && ! -f "${1%SHP}DBF" ]] then # Default value FROM_CODE=ASCII else # Use Language Driver Identifiers (LGID), in dbf file header, byte 29 LGID=$( hexdump -n1 -s29 -C "${1%shp}dbf" | head -1 | awk '{print $2}' ) # Translate LGID into a code page FROM_CODE=$( awk -v LGID="$LGID" 'BEGIN { CP["0x01"] = "CP437" # U.S. MS–DOS CP["0x02"] = "CP850" # International MS–DOS CP["0x03"] = "CP1252" # Windows ANSI CP["0x08"] = "CP865" # Danish OEM CP["0x09"] = "CP437" # Dutch OEM CP["0x0A"] = "CP850" # Dutch OEM* CP["0x0B"] = "CP437" # Finnish OEM CP["0x0D"] = "CP437" # French OEM CP["0x0E"] = "CP850" # French OEM* CP["0x0F"] = "CP437" # German OEM CP["0x10"] = "CP850" # German OEM* CP["0x11"] = "CP437" # Italian OEM CP["0x12"] = "CP850" # Italian OEM* CP["0x13"] = "CP932" # Japanese Shift-JIS CP["0x14"] = "CP850" # Spanish OEM* CP["0x15"] = "CP437" # Swedish OEM CP["0x16"] = "CP850" # Swedish OEM* CP["0x17"] = "CP865" # Norwegian OEM CP["0x18"] = "CP437" # Spanish OEM CP["0x19"] = "CP437" # English OEM (Britain) CP["0x1A"] = "CP850" # English OEM (Britain)* CP["0x1B"] = "CP437" # English OEM (U.S.) CP["0x1C"] = "CP863" # French OEM (Canada) CP["0x1D"] = "CP850" # French OEM* CP["0x1F"] = "CP852" # Czech OEM CP["0x22"] = "CP852" # Hungarian OEM CP["0x23"] = "CP852" # Polish OEM CP["0x24"] = "CP860" # Portuguese OEM CP["0x25"] = "CP850" # Portuguese OEM* CP["0x26"] = "CP866" # Russian OEM CP["0x37"] = "CP850" # English OEM (U.S.)* CP["0x40"] = "CP852" # Romanian OEM CP["0x4D"] = "CP936" # Chinese GBK (PRC) CP["0x4E"] = "CP949" # Korean (ANSI/OEM) CP["0x4F"] = "CP950" # Chinese Big5 (Taiwan) CP["0x50"] = "CP874" # Thai (ANSI/OEM) CP["0x57"] = "CP1252" # ANSI CP["0x58"] = "CP1252" # Western European ANSI CP["0x59"] = "CP1252" # Spanish ANSI CP["0x64"] = "CP852" # Eastern European MS–DOS CP["0x65"] = "CP866" # Russian MS–DOS CP["0x66"] = "CP865" # Nordic MS–DOS CP["0x67"] = "CP861" # Icelandic MS–DOS CP["0x6A"] = "CP737" # Greek MS–DOS (437G) CP["0x6B"] = "CP857" # Turkish MS–DOS CP["0x6C"] = "CP863" # French–Canadian MS–DOS CP["0x78"] = "CP950" # Taiwan Big 5 CP["0x79"] = "CP949" # Hangul (Wansung) CP["0x7A"] = "CP936" # PRC GBK CP["0x7B"] = "CP932" # Japanese Shift-JIS CP["0x7C"] = "CP874" # Thai Windows/MS–DOS CP["0x86"] = "CP737" # Greek OEM CP["0x87"] = "CP852" # Slovenian OEM CP["0x88"] = "CP857" # Turkish OEM CP["0xC8"] = "CP1250" # Eastern European Windows CP["0xC9"] = "CP1251" # Russian Windows CP["0xCA"] = "CP1254" # Turkish Windows CP["0xCB"] = "CP1253" # Greek Windows CP["0xCC"] = "CP1257" # Baltic Windows LGID = "0x" toupper(LGID) # Use code page if available, default = CP1252 = Windows ANSI print LGID in CP ? CP[LGID] : "CP1252" }' ) fi # Make ogr2ogr write to tmp file, transform to WGS84, if source SRS is known # ogr2ogr -f KML -skipfailures tmp.kml "$1" $t_srs 2>tmp.err # Use ogr2ogr exit code and decide what to do # if [[ $? == 0 ]] then # Remove Style elements from ogr2ogr output # in order to avoid schema validation errors cat tmp.kml | grep -v Style | # Convert to UTF-8 encoding iconv -c -f $FROM_CODE -t UTF-8 | # Use Awk hack for fixing the order of Schema and Folder elements awk ' NR == 3 { folder = orig = $0 ; sub("<Document>", "", folder) ; next } NR == 4 { print ( /Schema/ ? "<Document>" : orig ) ORS $0 ; next } /^<\/Schema>/ { print $0 ORS folder ; next } { print } ' else # Return the error message and remove tmp file cat tmp.err fi # Remove potential tmp file(s) rm -rf tmp.kml tmp.err exit 0
follow-up: 7 comment:6 by , 15 years ago
See #2971 that forces XML output to be ASCII if it is not valid UTF-8.
comment:7 by , 15 years ago
comment:8 by , 15 years ago
No, it's not the ultimate solution. Hopefully work will be done at some point in the shapefile driver so it can properly decode data from the DBF encoding to UTF-8. However #2971 is not totally useless, as it checks that people using the OGR API (and not directly ogr2ogr) don't feed XML based drivers with invalid data.
comment:9 by , 9 years ago
Perhaps these schema errors have been corrected later, GDAL having a badge: "OGC KML 2.2.0 (Official Reference Implementation)"
http://www.opengeospatial.org/resource/products/details/?pid=1218
comment:10 by , 9 years ago
Resolution: | → fixed |
---|---|
Status: | new → closed |
The issue is more about shapefile recoding, which has been implemented since then. Closing as fixed
The correct solution starts in the shapefile driver, but .dbf encoding handling is rather involved, and there is no one with the skills and time prepared to work on this issue at this time.